1 Introduction

The matter of interest for this assignment will be the impact that incomplete data (observed data) has on our inferences compared to the inferences we make with complete data (true data). To investigate the effect that missing values have on model inferences, we will build a random multiple regression model.

Firstly, we provide descriptive statistics and correlations. In table 2.1 we compare the head of the observed data and the true data. Additionally in 2.2 the means and variances are compared. With regard to correlations, we present two correlations matrices. One for the observed data 2.3 and the other for the true data 2.4.

Secondly we present our multiple regression model in table 2.5. Our model consists of the outcome variable: active heart rate and the predictors: age and smoke. We also included an interaction effect between bmi and sex. The first three columns reflect the observed data whereas the following columns reflect the true data.

The research question we tend to answer in accordance with our model is: What impact do missing values have on an “active heart rate” model inference?

Thirldy we start by inspecting the missing values. We try to find out where the missing values occur. In 3.1 we start by giving a global overview of the missings. Then in 3.2 we compare the distributions for the observed data and the missing values.

Lastly we perform t-tests and logistic regression functions on the variables that contain missing values to check what type of missings we are dealing with, e.g. MNAR, MAR or MCAR. We also provide plots here to visualize where the missing values occur.



2 Observed vs True data

In this section we will compare the observed with the true dataset.

Table 2.1: Observed Data
age smoke sex intensity active rest height weight bmi
42 no female high NA 75 NA NA 22.4
31 NA male low NA 62 NA NA 23.8
36 no male low 109 76 182 78.0 23.5
31 no female low 78 62 164 53.9 20.0
42 no male low NA 66 189 NA 23.4
Table 2.1: True Data
age smoke sex intensity active rest height weight bmi
42 no female high 94 75 161 58.1 22.4
31 no male low 86 62 184 80.6 23.8
36 no male low 109 76 182 78.0 23.5
31 no female low 78 62 164 53.9 20.0
42 no male low 103 66 189 83.6 23.4

2.1 Descriptives

Obviously, neither the mean nor the variance of age, and rest changed since these has no missing values.

The mean of active is also almost entirely unaffected. The variance of active changed a bit in the observed data, but this difference is simply due to sampling variability (we’ve deleted about 40% of the observations). The missing values in active are MCAR, so we would not expect any substantial changes in the marginal distribution of active.

The mean of height is is also almost entirely unaffected. The variance of active changed a bit in the observed data, but this difference is simply due to sampling variability (we’ve deleted about 30% of the observations). The missing values in height are MCAR, so we would not expect any substantial changes in the marginal distribution of height.

The mean of weight is is also almost entirely unaffected. The variance of active changed a bit in the observed data, but this difference is simply due to sampling variability (we’ve deleted about 57% of the observations). The missing values in weight are MCAR, so we would not expect any substantial changes in the marginal distribution of weight.

The mean of bmi is is also almost entirely unaffected. The variance of active changed a bit in the observed data, but this difference is simply due to sampling variability (we’ve deleted about 30% of the observations). The missing values in bmi are MCAR, so we would not expect any substantial changes in the marginal distribution of bmi.

Furthermore, the variance of the variables age and rest are unaffected in the observed data set. The variance of the variable active in the observed data set is .01 lower than the true data set, and thus almost entirely unaffected. However the variables height, weight, and bmi have greater variance in the true data set than the observed data set. This implies that the missingness causes an underestimation of the variance.

Table 2.2: Means and variances in true and observed dataset
Variables \(M_{obs}\) \(M_{true}\) var obs var true \(N_{obs}\) \(N_{true}\)
Age 38.52 38.52 149.73 149.73 306 306
Active 92.58 93.13 383.05 383.04 183 306
Rest 69.83 69.83 120.78 120.78 306 306
Height 174.50 173.99 100.66 105.29 214 306
Weight 73.91 73.58 260.26 274.85 132 306
Bmi 24.11 24.06 12.91 13.38 213 306
Note.
obs = Observed Dataset, true = True Dataset

Over here categorical data descriptions

2.2 Correlations

Table 2.3: Correlations of observed data
age smoke sex intensity active rest height weight bmi
age 1.00 0.01 -0.17 0.21 -0.49 -0.39 0.19 0.25 0.18
smoke 0.01 1.00 -0.09 -0.29 0.15 0.23 0.18 0.18 0.18
sex -0.17 -0.09 1.00 -0.09 0.11 0.06 -0.73 -0.68 -0.42
intensity 0.21 -0.29 -0.09 1.00 -0.37 -0.55 0.13 0.12 0.02
active -0.49 0.15 0.11 -0.37 1.00 0.56 0.00 0.01 0.05
rest -0.39 0.23 0.06 -0.55 0.56 1.00 -0.20 -0.12 0.06
height 0.19 0.18 -0.73 0.13 0.00 -0.20 1.00 0.78 0.34
weight 0.25 0.18 -0.68 0.12 0.01 -0.12 0.78 1.00 0.88
bmi 0.18 0.18 -0.42 0.02 0.05 0.06 0.34 0.88 1.00
Table 2.4: Correlations of true data
age smoke sex intensity active rest height weight bmi
age 1.00 -0.05 -0.17 0.21 -0.54 -0.39 0.20 0.23 0.20
smoke -0.05 1.00 -0.11 -0.31 0.18 0.27 0.17 0.25 0.24
sex -0.17 -0.11 1.00 -0.09 0.09 0.06 -0.72 -0.69 -0.47
intensity 0.21 -0.31 -0.09 1.00 -0.37 -0.55 0.12 0.06 0.01
active -0.54 0.18 0.09 -0.37 1.00 0.61 -0.10 0.02 0.09
rest -0.39 0.27 0.06 -0.55 0.61 1.00 -0.15 -0.04 0.05
height 0.20 0.17 -0.72 0.12 -0.10 -0.15 1.00 0.77 0.36
weight 0.23 0.25 -0.69 0.06 0.02 -0.04 0.77 1.00 0.87
bmi 0.20 0.24 -0.47 0.01 0.09 0.05 0.36 0.87 1.00

Correlation description

The correlations between the variables of the observed data set are marginally different than the correlations of the true data. Although the majority of the correlations are almost identical, a few correlations are negative in the observed data, and positive in the true data. This effect also occurs vice versa. In example, the correlation between the variables smoke and age of the observed data set is positive (r = 0.01) albeit almost 0, while the correlation for these variables in the true data set is negative (r = -0.05).

However, the impact of missing data on the correlations appears to be small, as the difference in correlation coefficients between the two data sets are negligible. Although some correlations differ in valency between the data sets, the correlation coefficients remain close to 0, and thus, not distort inferences made with the observed data set.

2.3 Regression

Table 2.5: Regression analysis of True and Observed Data
\(\beta_{obs}\) \(SE_{obs}\) \(p_{obs}\) \(\beta_{true}\) \(SE_{true}\) \(p_{true}\)
(Intercept) 78.444 14.34 0.000 80.384 9.03 0.000
age -0.809 0.11 0.000 -0.883 0.07 0.000
bmi 1.681 0.55 0.003 1.776 0.35 0.000
sexfemale 32.756 20.78 0.117 43.460 14.16 0.002
smokeyes 1.615 2.91 0.580 3.516 1.99 0.078
bmi:sexfemale -1.131 0.88 0.199 -1.674 0.60 0.006

2.4 Answering the research question

When examining Table 2.5 Regression analysis of True and Observed data we observe several differences in the beta coefficients, standard error and p-values. The table contains only variables with missing values, and an interaction effect. Although almost all beta coefficients are nearly equal, the beta coefficients of the observed data set are systematically underestimated. This is especially the case for sexfemale, as the difference between the beta coefficients is almost 9.0. When making inferences, the effect of sex on active heart rate would be underestimated.

Regarding the standard errors, missing data caused these parameters of the observed data set to be systemetically overestimated. Larger standard errors contribute to the possibility of making a type II error, as is the case in our data set. The larger standard errors in the observed data set might have played a role in the variables sexfemale and the interaction bmi:sexfemale turning non significant. When making inferences with the model based on the observed data, these variables would wrongly be neglected.

Concluding, the missing data causes the standard errors to be greater, resulting in less accurate beta coefficients. Moreover, some p-values turn out non signifianct, caused by underestimated beta coefficients. Thus, the model based on observed data leads to inaccurate inferences.


3 Missingness

There are 540 missing values. 0 for age, 0 for sex, 0 for intensity, 0 for rest, 58 for smoke, 92 for height, 93 for bmi, 123 for active, and 174 for weight. moreover there are 132 completely observed rows, 15 rows with one missing value, 37 rows with two missing values, 52 rows with three missing values, 55 rows with four missing values, 15 rows with five missing values.

The missingness in the data is non-monotone, because the variable with the least missing values (smoke) has observed values for other variables with more missingness (e.g., smoke and bmi). The missingness would be monotone if the variable with the least missing values (smoke), would have missing values on all other variables with more missingness (e.g., height). Interestingly, a monotone pattern is only the case for smoke and weight.

pattern of the missingnesspattern of the missingnesspattern of the missingnesspattern of the missingness

Figure 3.1: pattern of the missingness




3.1 Looking for the missingness

In this section we are going to investigate whether the mean of the missing values differ significantly from the mean of the observed values. We will do this by using a paired sampled t-test for the numeric variables. In order to compare the mean of the missing values with the true values, we computed a logical vector for each vector that has missing observations. The missingness vectors have the value TRUE for all missing entries and FALSE for all observed entries. These missingness vectors will be used as grouping variable in the true data set to compare the missing values with the observed values. For smoke, which is a categorical variable, we will use a \(x^2\) test

weight: \(t =\) 0.381, \(p =\) 0.704

height: \(t =\) 1.271, \(p =\) 0.205

bmi: \(t =\) 0.336, \(p =\) 0.737

active: \(t =\) -0.606, \(p =\) 0.545

smoke: \(x^2 =\) 1.154, \(p =\) 0.283

comparing the distribution of the observed and true dataset

Figure 3.2: comparing the distribution of the observed and true dataset

3.2 Missingness of weight

missing weight on sex: \(x^2 =\) 0, \(p =\) 1

missing weight on smoke: \(x^2 =\) 0.036, \(p =\) 0.848

missing weight on intensity: \(x^2 =\) 2.589, \(p =\) 0.274

missing weight on rest: \(t =\) -0.482, \(p =\) 0.63

missing weight on age: \(t =\) -0.59, \(p =\) 0.556

missing weight on height: \(t =\) -0.639, \(p =\) 0.525

missing weight on bmi: \(t =\) -0.012, \(p =\) 0.99

missing weight on active: \(t =\) -1.44, \(p =\) 0.156

Looking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MAR

Figure 3.3: Looking whether the missingness of weight is MAR

3.3 Missingness of height

missing height on sex: \(x^2 =\) 0, \(p =\) 1

missing height on smoke: \(x^2 =\) 0.111, \(p =\) 0.739

missing height on intensity: \(x^2 =\) 3.563, \(p =\) 0.168

missing height on rest: \(t =\) 0.242, \(p =\) 0.809

missing height on age: \(t =\) 0.32, \(p =\) 0.749

missing height on bmi: \(t =\) -0.012, \(p =\) 0.99

missing height on active: \(t =\) -1.535, \(p =\) 0.137

Looking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MAR

Figure 3.4: Looking whether the missingness of height is MAR

3.4 Missingness of Active

missing active on sex: \(x^2 =\) 1.957, \(p =\) 0.162

missing active on smoke: \(x^2 =\) 0.293, \(p =\) 0.589

missing active on intensity: \(x^2 =\) 2.193, \(p =\) 0.334

missing active on rest: \(t =\) -1.558, \(p =\) 0.12

missing active on age: \(t =\) -0.963, \(p =\) 0.337

missing active on height: \(t =\) -0.232, \(p =\) 0.817

missing active on bmi: \(t =\) -1.883, \(p =\) 0.062

missing active on weight: \(t =\) -1.948, \(p =\) 0.059

Looking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MAR

Figure 3.5: Looking whether the missingness of active is MAR

3.5 Missingness of Bmi

missing bmi on sex: \(x^2 =\) 0.019, \(p =\) 0.889

missing bmi on smoke: \(x^2 =\) 0, \(p =\) 1

missing bmi on intensity: \(x^2 =\) 1.476, \(p =\) 0.478

missing bmi on rest: \(t =\) 0.021, \(p =\) 0.983

missing bmi on age: \(t =\) -0.368, \(p =\) 0.713

missing bmi on height: \(t =\) -0.639, \(p =\) 0.525

missing bmi on active: \(t =\) -0.717, \(p =\) 0.478

Looking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MAR

Figure 3.6: Looking whether the missingness of bmi is MAR

3.6 Missingness of Smoke

missing smoke on sex: \(x^2 =\) 5.037, \(p =\) 0.025

missing smoke on intensity: \(x^2 =\) 1.722, \(p =\) 0.423

missing smoke on rest: \(t =\) 0.779, \(p =\) 0.438

missing smoke on age: \(t =\) -1.271, \(p =\) 0.208

missing smoke on height: \(t =\) -0.347, \(p =\) 0.731

missing smoke on bmi: \(t =\) -1.338, \(p =\) 0.188

missing smoke on weight: \(t =\) -0.785, \(p =\) 0.444

Looking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MAR

Figure 3.7: Looking whether the missingness of smoking is MAR